AITopics | large-scale evaluation

Collaborating Authors

large-scale evaluation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models

Neural Information Processing SystemsJun-14-2026, 06:10:53 GMT

Data attribution methods quantify the influence of training data on model outputs and are becoming increasingly relevant for a wide range of LLM research and applications, including dataset curation, model interpretability, data valuation. However, there remain critical gaps in systematic LLM-centric evaluation of data attribution methods. To this end, we introduce DATE-LM (Data Attribution Evaluation in Language Models), a unified benchmark for evaluating data attribution methods through real-world LLM applications.

data attribution method, large language model, natural language, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)

Add feedback

Beyond Deep Ensembles: A Large-Scale Evaluation of Bayesian Deep Learning under Distribution Shift

Neural Information Processing SystemsDec-25-2025, 13:07:54 GMT

Bayesian deep learning (BDL) is a promising approach to achieve well-calibrated predictions on distribution-shifted data. Nevertheless, there exists no large-scale survey that evaluates recent SOTA methods on diverse, realistic, and challenging benchmark tasks in a systematic manner. To provide a clear picture of the current state of BDL research, we evaluate modern BDL algorithms on real-world datasets from the WILDS collection containing challenging classification and regression tasks, with a focus on generalization capability and calibration under distribution shift. We compare the algorithms on a wide range of large, convolutional and transformer-based neural network architectures. In particular, we investigate a signed version of the expected calibration error that reveals whether the methods are over-or underconfident, providing further insight into the behavior of the methods. Further, we provide the first systematic evaluation of BDL for fine-tuning large pre-trained models, where training from scratch is prohibitively expensive. Finally, given the recent success of Deep Ensembles, we extend popular single-mode posterior approximations to multiple modes by the use of ensembles. While we find that ensembling single-mode approximations generally improves the generalization capability and calibration of the models by a significant margin, we also identify a failure mode of ensembles when finetuning large transformer-based language models. In this setting, variational inference based approaches such as last-layer Bayes By Backprop outperform other methods in terms of accuracy by a large margin, while modern approximate inference algorithms such as SWAG achieve the best calibration.

bayesian deep learning, deep ensemble, large-scale evaluation, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.84)

Add feedback

On Large-scale Evaluation of Embedding Models for Knowledge Graph Completion

Shirvani-Mahdavi, Nasim, Akrami, Farahnaz, Li, Chengkai

arXiv.org Artificial IntelligenceJun-11-2025

Knowledge graph embedding (KGE) models are extensively studied for knowledge graph completion, yet their evaluation remains constrained by unrealistic benchmarks. Standard evaluation metrics rely on the closed-world assumption, which penalizes models for correctly predicting missing triples, contradicting the fundamental goals of link prediction. These metrics often compress accuracy assessment into a single value, obscuring models' specific strengths and weaknesses. The prevailing evaluation protocol, link prediction, operates under the unrealistic assumption that an entity's properties, for which values are to be predicted, are known in advance. While alternative protocols such as property prediction, entity-pair ranking, and triple classification address some of these limitations, they remain underutilized. Moreover, commonly used datasets are either faulty or too small to reflect real-world data. Few studies examine the role of mediator nodes, which are essential for modeling n-ary relationships, or investigate model performance variation across domains. This paper conducts a comprehensive evaluation of four representative KGE models on large-scale datasets FB-CVT-REV and FB+CVT-REV. Our analysis reveals critical insights, including substantial performance variations between small and large datasets, both in relative rankings and absolute metrics, systematic overestimation of model capabilities when n-ary relations are binarized, and fundamental limitations in current evaluation protocols and metrics.

artificial intelligence, dataset, prediction, (15 more...)

arXiv.org Artificial Intelligence

2504.0897

Country: North America > United States (1.00)

Genre: Research Report > New Finding (1.00)

Industry: Government > Regional Government > North America Government > United States Government (0.67)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.94)

Add feedback

Beyond Deep Ensembles: A Large-Scale Evaluation of Bayesian Deep Learning under Distribution Shift

Neural Information Processing SystemsJan-18-2025, 16:10:08 GMT

bayesian deep learning, deep ensemble, large-scale evaluation, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks

Xie, Yong, Zheng, Weijie, Huang, Hanxun, Ye, Guangnan, Ma, Xingjun

arXiv.org Artificial IntelligenceNov-27-2024

As deep learning models are increasingly deployed in safety-critical applications, evaluating their vulnerabilities to adversarial perturbations is essential for ensuring their reliability and trustworthiness. Over the past decade, a large number of white-box adversarial robustness evaluation methods (i.e., attacks) have been proposed, ranging from single-step to multi-step methods and from individual to ensemble methods. Despite these advances, challenges remain in conducting meaningful and comprehensive robustness evaluations, particularly when it comes to large-scale testing and ensuring evaluations reflect real-world adversarial risks. In this work, we focus on image classification models and propose a novel individual attack method, Probability Margin Attack (PMA), which defines the adversarial margin in the probability space rather than the logits space. We analyze the relationship between PMA and existing cross-entropy or logits-margin-based attacks, and show that PMA can outperform the current state-of-the-art individual methods. Building on PMA, we propose two types of ensemble attacks that balance effectiveness and efficiency. Furthermore, we create a million-scale dataset, CC1M, derived from the existing CC3M dataset, and use it to conduct the first million-scale white-box adversarial robustness evaluation of adversarially-trained ImageNet models. Our findings provide valuable insights into the robustness gaps between individual versus ensemble attacks and small-scale versus million-scale evaluations.

artificial intelligence, machine learning, robustness, (17 more...)

arXiv.org Artificial Intelligence

2411.1521

Country:

Asia > China > Shanghai > Shanghai (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Cost-Control in Display Advertising: Theory vs Practice

Katti, Anoop R, Gonçalves, Rui C., Iakovlev, Rinchin

arXiv.org Artificial IntelligenceSep-5-2024

In display advertising, advertisers want to achieve a marketing objective with constraints on budget and cost-per-outcome. This is usually formulated as an optimization problem that maximizes the total utility under constraints. The optimization is carried out in an online fashion in the dual space - for an incoming Ad auction, a bid is placed using an optimal bidding formula, assuming optimal values for the dual variables; based on the outcome of the previous auctions, the dual variables are updated in an online fashion. While this approach is theoretically sound, in practice, the dual variables are not optimal from the beginning, but rather converge over time. Specifically, for the cost-constraint, the convergence is asymptotic. As a result, we find that cost-control is ineffective. In this work, we analyse the shortcomings of the optimal bidding formula and propose a modification that deviates from the theoretical derivation. We simulate various practical scenarios and study the cost-control behaviors of the two algorithms. Through a large-scale evaluation on the real-word data, we show that the proposed modification reduces the cost violations by 50%, thereby achieving a better cost-control than the theoretical bidding formula.

advertiser, bidding formula, constraint, (15 more...)

arXiv.org Artificial Intelligence

2409.03874

Country:

Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.05)
Europe > Germany > Berlin (0.05)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.50)

Industry:

Marketing (1.00)
Information Technology > Services (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.69)

Add feedback

Beyond ELBOs: A Large-Scale Evaluation of Variational Methods for Sampling

Blessing, Denis, Jia, Xiaogang, Esslinger, Johannes, Vargas, Francisco, Neumann, Gerhard

arXiv.org Machine LearningJun-11-2024

Monte Carlo methods, Variational Inference, and their combinations play a pivotal role in sampling from intractable probability distributions. However, current studies lack a unified evaluation framework, relying on disparate performance measures and limited method comparisons across diverse tasks, complicating the assessment of progress and hindering the decision-making of practitioners. In response to these challenges, our work introduces a benchmark that evaluates sampling methods using a standardized task suite and a broad range of performance criteria. Moreover, we study existing metrics for quantifying mode collapse and introduce novel metrics for this purpose. Our findings provide insights into strengths and weaknesses of existing sampling methods, serving as a valuable reference for future developments. The code is publicly available here.

artificial intelligence, large-scale evaluation, machine learning, (13 more...)

arXiv.org Machine Learning

2406.07423

Country:

Europe > Germany (0.46)
Europe > United Kingdom (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (0.48)

Industry: Energy > Oil & Gas > Upstream (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

Beyond Deep Ensembles: A Large-Scale Evaluation of Bayesian Deep Learning under Distribution Shift

Seligmann, Florian, Becker, Philipp, Volpp, Michael, Neumann, Gerhard

arXiv.org Artificial IntelligenceOct-24-2023

Bayesian deep learning (BDL) is a promising approach to achieve well-calibrated predictions on distribution-shifted data. Nevertheless, there exists no large-scale survey that evaluates recent SOTA methods on diverse, realistic, and challenging benchmark tasks in a systematic manner. To provide a clear picture of the current state of BDL research, we evaluate modern BDL algorithms on real-world datasets from the WILDS collection containing challenging classification and regression tasks, with a focus on generalization capability and calibration under distribution shift. We compare the algorithms on a wide range of large, convolutional and transformer-based neural network architectures. In particular, we investigate a signed version of the expected calibration error that reveals whether the methods are over- or under-confident, providing further insight into the behavior of the methods. Further, we provide the first systematic evaluation of BDL for fine-tuning large pre-trained models, where training from scratch is prohibitively expensive. Finally, given the recent success of Deep Ensembles, we extend popular single-mode posterior approximations to multiple modes by the use of ensembles. While we find that ensembling single-mode approximations generally improves the generalization capability and calibration of the models by a significant margin, we also identify a failure mode of ensembles when finetuning large transformer-based language models. In this setting, variational inference based approaches such as last-layer Bayes By Backprop outperform other methods in terms of accuracy by a large margin, while modern approximate inference algorithms such as SWAG achieve the best calibration.

bayesian deep learning, distribution shift, large-scale evaluation, (1 more...)

arXiv.org Artificial Intelligence

2306.12306

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Large-Scale Evaluation of Topic Models and Dimensionality Reduction Methods for 2D Text Spatialization

Atzberger, Daniel, Cech, Tim, Scheibel, Willy, Trapp, Matthias, Richter, Rico, Döllner, Jürgen, Schreck, Tobias

arXiv.org Artificial IntelligenceJul-17-2023

Topic models are a class of unsupervised learning algorithms for detecting the semantic structure within a text corpus. Together with a subsequent dimensionality reduction algorithm, topic models can be used for deriving spatializations for text corpora as two-dimensional scatter plots, reflecting semantic similarity between the documents and supporting corpus analysis. Although the choice of the topic model, the dimensionality reduction, and their underlying hyperparameters significantly impact the resulting layout, it is unknown which particular combinations result in high-quality layouts with respect to accuracy and perception metrics. To investigate the effectiveness of topic models and dimensionality reduction methods for the spatialization of corpora as two-dimensional scatter plots (or basis for landscape-type visualizations), we present a large-scale, benchmark-based computational evaluation. Our evaluation consists of (1) a set of corpora, (2) a set of layout algorithms that are combinations of topic models and dimensionality reductions, and (3) quality metrics for quantifying the resulting layout. The corpora are given as document-term matrices, and each document is assigned to a thematic class. The chosen metrics quantify the preservation of local and global properties and the perceptual effectiveness of the two-dimensional scatter plots. By evaluating the benchmark on a computing cluster, we derived a multivariate dataset with over 45 000 individual layouts and corresponding quality metrics. Based on the results, we propose guidelines for the effective design of text spatializations that are based on topic models and dimensionality reductions. As a main result, we show that interpretable topic models are beneficial for capturing the structure of text corpora. We furthermore recommend the use of t-SNE as a subsequent dimensionality reduction.

large-scale evaluation, model and dimensionality reduction method, text spatialization, (1 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TVCG.2023.3326569

2307.1177

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Dimensionality Reduction (1.00)

Add feedback

Large-scale Evaluation of Transformer-based Article Encoders on the Task of Citation Recommendation

Medić, Zoran, Šnajder, Jan

arXiv.org Artificial IntelligenceSep-12-2022

Recently introduced transformer-based article encoders (TAEs) designed to produce similar vector representations for mutually related scientific articles have demonstrated strong performance on benchmark datasets for scientific article recommendation. However, the existing benchmark datasets are predominantly focused on single domains and, in some cases, contain easy negatives in small candidate pools. Evaluating representations on such benchmarks might obscure the realistic performance of TAEs in setups with thousands of articles in candidate pools. In this work, we evaluate TAEs on large benchmarks with more challenging candidate pools. We compare the performance of TAEs with a lexical retrieval baseline model BM25 on the task of citation recommendation, where the model produces a list of recommendations for citing in a given input article. We find out that BM25 is still very competitive with the state-of-the-art neural retrievers, a finding which is surprising given the strong performance of TAEs on small benchmarks. As a remedy for the limitations of the existing benchmarks, we propose a new benchmark dataset for evaluating scientific article representations: Multi-Domain Citation Recommendation dataset (MDCR), which covers different scientific fields and contains challenging candidate pools.

information retrieval, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2209.05452

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Croatia > Zagreb County > Zagreb (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
(2 more...)

Add feedback